`unicode_data` refactors #147622

Kmeakin · 2025-10-13T00:58:14Z

Minor refactors to unicode_data that occured to me while trying to reduce the size of the tables. Splitting into a separate PR. NFC

rustbot · 2025-10-13T00:58:17Z

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

rustbot · 2025-10-13T00:58:19Z

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

library/core/src/unicode/rt.rs

src/tools/unicode-table-generator/src/main.rs

joboet · 2025-10-14T17:23:33Z

src/tools/unicode-table-generator/src/main.rs

+}
+
+fn rustfmt(path: &str) {
+    std::process::Command::new("rustfmt").arg(path).status().expect("rustfmt failed");


Is rustfmt really always in PATH when this command is run? Otherwise, I think it'd be easier to slap a big #[rustfmt::skip] on the mod unicode_data.

The intention is to keep the generated unicode_data.rs readable without having to carefully construct well-formatted code in the metaprogram. For that, we need to rust rustfmt

src/tools/unicode-table-generator/src/case_mapping.rs

src/tools/unicode-table-generator/src/main.rs

rustbot · 2025-10-19T23:42:39Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Instead of `include_str!()`ing `range_search.rs`, just make it a normal module under `core::unicode`. This means the same source code doesn't have to be checked in twice, and it plays nicer with IDEs. Also rename it to `rt` since it includes functions for searching the bitsets and case conversion tables as well as the range represesentation.

Remove `#[rustfmt::skip]` from all the generated modules in `unicode_data.rs`. This means we won't have to worry so much about getting indetation and formatting right when generating code. Exempted for now some tables which would be too big when formatted by `rustfmt`.

This check was made redundant (it will always be true) when we removed all ASCII characters from the tables (rust-lang@a8c6694).

To make the final output code easier to see: * Get rid of the unnecessary line-noise of `.unwrap()`ing calls to `write!()` by moving the `.unwrap()` into a macro. * Join consecutive `write!()` calls using a single multiline format string. * Replace `.push()` and `.push_str(format!())` with `write!()`. * If after doing all of the above, there is only a single `write!()` call in the function, just construct the string directly with `format!()`.

Instead of generating a standalone executable to test `unicode_data`, generate normal tests in `coretests`. This ensures tests are always generated, and will be run as part of the normal testsuite. Also change the generated tests to loop over lookup tables, rather than generating a separate `assert_eq!()` statement for every codepoint. The old approach produced a massive (20,000 lines plus) file which took minutes to compile!

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 13, 2025

rustbot assigned joboet Oct 13, 2025

Kmeakin force-pushed the km/unicode-data/refactors branch from 2c5244e to 90adbe2 Compare October 13, 2025 01:07

bjorn3 reviewed Oct 13, 2025

View reviewed changes

library/core/src/unicode/rt.rs Outdated Show resolved Hide resolved

Kmeakin force-pushed the km/unicode-data/refactors branch from 90adbe2 to 1a646cf Compare October 13, 2025 20:31

This comment has been minimized.

Sign in to view

joboet reviewed Oct 14, 2025

View reviewed changes

Kmeakin force-pushed the km/unicode-data/refactors branch 3 times, most recently from 1b56c98 to dc0dcf5 Compare October 19, 2025 23:42

rustbot added the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Oct 19, 2025

This comment has been minimized.

Sign in to view

Kmeakin added 4 commits October 20, 2025 01:38

refactor: remove check that first_code_point is non-ascii

6f10a51

This check was made redundant (it will always be true) when we removed all ASCII characters from the tables (rust-lang@a8c6694).

Kmeakin force-pushed the km/unicode-data/refactors branch from dc0dcf5 to d2c9773 Compare October 20, 2025 01:02

This comment has been minimized.

Sign in to view

Kmeakin force-pushed the km/unicode-data/refactors branch from d2c9773 to 41d988f Compare October 20, 2025 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`unicode_data` refactors #147622

`unicode_data` refactors #147622

Kmeakin commented Oct 13, 2025

Uh oh!

rustbot commented Oct 13, 2025

Uh oh!

rustbot commented Oct 13, 2025

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

joboet Oct 14, 2025

Uh oh!

Kmeakin Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

rustbot commented Oct 19, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

unicode_data refactors #147622

Are you sure you want to change the base?

unicode_data refactors #147622

Conversation

Kmeakin commented Oct 13, 2025

Uh oh!

rustbot commented Oct 13, 2025

Uh oh!

rustbot commented Oct 13, 2025

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

joboet Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Kmeakin Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rustbot commented Oct 19, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

`unicode_data` refactors #147622

`unicode_data` refactors #147622